Context

Reading data from a software version control system can be pretty useful if you want to answer some evolutionary questions like

  • Who are our main committers to the software?
  • Are there any areas in the code where only one developer knows of?
  • Where were we working on the last months?

In my previous notebook, I showed you how to read a Git repository directly in Python with Pandas and GitPython. As much as I like that approach (because everything is in one place and therefore reproducible), it's (currently) very slow while reading all the statistics information (but I'll work on that!). What I want to have now is a really fast method to read in a complete Git repository.

I take this opportunity to show you how to read any kind of structure, linear data into Pandas' DataFrame. The general rule of thumb is: As long as you see a pattern in the raw data, Pandas can read and tame it, too!

The idea

We are taking a shortcut for retrieving the commit history by exporting it into a log file. You can use e. g.

git log --all --numstat --pretty=format:'--%h--%ad--%aN' --no-renames > git.log 
to do this. This will output a file with all the log information of a repository.

In this notebook, we analyze the Git repository of aim42 (an open book project about how to improve legacy systems).

The first entries of that file look something like this:


In [1]:
with open (r'data/gitlog_aim42.log') as log:
    [print(line, end='') for line in log.readlines()[:8]]


--ea7e08b--Tue Nov 29 21:42:16 2016 +0100--feststelltaste
2	0	src/main/asciidoc/appendices/bibliography.adoc
1	7	src/main/asciidoc/pattern-index.adoc
12	1	src/main/asciidoc/patterns/improve/anticorruption-layer.adoc

--fa1ca6f--Thu Dec 22 08:04:18 2016 +0100--feststelltaste
2	0	src/main/asciidoc/appendices/bibliography.adoc
2	2	src/main/asciidoc/patterns/analyze/busfactor.adoc

For each commit, we choose to create a header line with the following commit info (by using --pretty=format:'--%h--%ad--%aN'):

--fa1ca6f--Thu Dec 22 08:04:18 2016 +0100--feststelltaste
It contains the SHA key, the timestamp as well as the author's name of the commit, separated by --. For each other row, we got some statistics about the modified files:
2   0   src/main/asciidoc/appendices/bibliography.adoc

It contains the number of lines inserted, the number of lines deleted and the relative path of the file. With a little trick and a little bit of data wrangling, we can read that information into a nicely structured DataFrame.

Let's get started!

Import the data

First, I'll show you my approach on how to read nearly everything into a DataFrame. The key is to use Pandas' read_csv for reading "non-character separated values". How to do that? We simply choose a separator that doesn't occur in the file that we want to read. My favorite character for this is the "DEVICE CONTROL TWO" character U+0012. I haven't encountered a situation yet where this character was included in a data set.

We just read our git.log file without any headers (because there are none) and give the only column a nice name.


In [2]:
import pandas as pd

commits = pd.read_csv("data\gitlog_aim42.log", 
                      sep="\u0012", 
                      header=None, 
                      names=['raw'])
commits.head()


Out[2]:
raw
0 --ea7e08b--Tue Nov 29 21:42:16 2016 +0100--fes...
1 2\t0\tsrc/main/asciidoc/appendices/bibliograph...
2 1\t7\tsrc/main/asciidoc/pattern-index.adoc
3 12\t1\tsrc/main/asciidoc/patterns/improve/anti...
4 --fa1ca6f--Thu Dec 22 08:04:18 2016 +0100--fes...

Data Wrangling

OK, but now we have a problem data wrangling challenge. We have the commit info as well as the statistic for the modified file in one column, but they don't belong together. What we want is to have the commit info along with the file statistics in separate columns to get some serious analysis started.

Commit info

Let's treat the commit info first. Luckily, we set some kind of anchor or marker to identify the commit info: Each commit info starts with a --. So let's extract all the commit info from the original commits DataFrame.


In [3]:
commit_marker = commits[
    commits['raw'].str.startswith("--")]
commit_marker.head()


Out[3]:
raw
0 --ea7e08b--Tue Nov 29 21:42:16 2016 +0100--fes...
4 --fa1ca6f--Thu Dec 22 08:04:18 2016 +0100--fes...
7 --c3d4e2d--Thu Dec 22 05:47:32 2016 +0100--Dr....
8 --3f793e8--Tue Nov 29 21:42:16 2016 +0100--fes...
12 --5d297c9--Wed Dec 21 20:49:33 2016 +0100--fes...

With this, we can focus on extracting the information of a commit info row. The next command could be looking a little frightening, but don't worry. We go through it step by step.


In [4]:
commit_info = commit_marker['raw'].str.extract(
        r"^--(?P<sha>.*?)--(?P<date>.*?)--(?P<author>.*?)$", 
        expand=True)   
commit_info['date'] = pd.to_datetime(commit_info['date'])
commit_info.head()


Out[4]:
sha date author
0 ea7e08b 2016-11-29 20:42:16 feststelltaste
4 fa1ca6f 2016-12-22 07:04:18 feststelltaste
7 c3d4e2d 2016-12-22 04:47:32 Dr. Gernot Starke
8 3f793e8 2016-11-29 20:42:16 feststelltaste
12 5d297c9 2016-12-21 19:49:33 feststelltaste

We want to extract some data from the raw column. For this, we use the extract method on the string representation (note the str) of all the rows. This method expects a regular expression. We provide our own regex

^--(?P<sha>.\*?)--(?P<date>.\*?)--(?P<author>.\*?)$
that works as follows:

  • ^: the beginning of the row
  • --: the two dashes that we choose and are used in the git log file as separator between the entries
  • (?P<sha>.*?)--: a named match group (marked by the ( and ) ) with the name sha for all characters (.*) until the next occurrence (?) of the -- separators.
  • and so on until
  • \$: the marker for the end of the row (actually, ^ and $ aren't needed, but it looks nicer from a regex string's perspective in my eyes ;-) )

I use these ugly looking, named match groups because then the name of such a group will be used by Pandas for the name of the column (therefore we avoid renaming the columns later on).

The expand=True keyword delivers a DataFrame with columns for each detected regex group.

We simply store the result into a new DataFrame variable commit_info.

Because we've worked with the string representation of the row, Pandas didn't recognize the right data types for our newly created columns. That's why we need to cast the date column to the right type.

OK, this part is ready, let's have a look at the file statistics!

File statistics

Every row that is not a commit info row is a file statistics row. So we just reuse the index of our already prepared commit_info DataFrame to get all the other data by saying "give me all commits that are not in the index of the commit_info's DataFrame".


In [5]:
file_stats_marker = commits[
    ~commits.index.isin(commit_info.index)]
file_stats_marker.head()


Out[5]:
raw
1 2\t0\tsrc/main/asciidoc/appendices/bibliograph...
2 1\t7\tsrc/main/asciidoc/pattern-index.adoc
3 12\t1\tsrc/main/asciidoc/patterns/improve/anti...
5 2\t0\tsrc/main/asciidoc/appendices/bibliograph...
6 2\t2\tsrc/main/asciidoc/patterns/analyze/busfa...

Luckily, the row's data is just a tab-separated string that we can easily split with the split method. We expand the result to get a DataFrame , rename the default columns to something that make more sense and adjust some data types. For the later, we use the keyword coerce that will let to_numeric return Nan's for all entries that are not a number.


In [6]:
file_stats = file_stats_marker['raw'].str.split(
    "\t", expand=True)
file_stats = file_stats.rename(
    columns={ 0: "insertions", 1: "deletions", 2: "filename"})
file_stats['insertions'] = pd.to_numeric(
    file_stats['insertions'], errors='coerce')
file_stats['deletions'] = pd.to_numeric(
    file_stats['deletions'], errors='coerce')
file_stats.head()


Out[6]:
insertions deletions filename
1 2.0 0.0 src/main/asciidoc/appendices/bibliography.adoc
2 1.0 7.0 src/main/asciidoc/pattern-index.adoc
3 12.0 1.0 src/main/asciidoc/patterns/improve/anticorrupt...
5 2.0 0.0 src/main/asciidoc/appendices/bibliography.adoc
6 2.0 2.0 src/main/asciidoc/patterns/analyze/busfactor.adoc

Putting it all together

Now we have three parts: all commits, the separated commit info and the file statistics.

We only need to glue the commit info and the file statistics together into a normalized DataFrame. For this, we have to make some adjustments to the indexes.

For the commit info, we want to have each info for each file statistics row. That means we reindex the commit info by using the index of the commits DataFrame...


In [7]:
commit_info.reindex(commits.index).head(3)


Out[7]:
sha date author
0 ea7e08b 2016-11-29 20:42:16 feststelltaste
1 NaN NaT NaN
2 NaN NaT NaN

...and fill the missing values for the file statistics' rows to get the needed structure. Together, this is done like the following:


In [8]:
commit_data = commit_info.reindex(
    commits.index).fillna(method="ffill")
commit_data.head()


Out[8]:
sha date author
0 ea7e08b 2016-11-29 20:42:16 feststelltaste
1 ea7e08b 2016-11-29 20:42:16 feststelltaste
2 ea7e08b 2016-11-29 20:42:16 feststelltaste
3 ea7e08b 2016-11-29 20:42:16 feststelltaste
4 fa1ca6f 2016-12-22 07:04:18 feststelltaste

After filling the file statistics rows, we can throw away the dedicated commit info rows by reusing the index from above (look at the index for seeing this clearly).


In [9]:
commit_data = commit_data[~commit_data.index.isin(commit_info.index)]
commit_data.head()


Out[9]:
sha date author
1 ea7e08b 2016-11-29 20:42:16 feststelltaste
2 ea7e08b 2016-11-29 20:42:16 feststelltaste
3 ea7e08b 2016-11-29 20:42:16 feststelltaste
5 fa1ca6f 2016-12-22 07:04:18 feststelltaste
6 fa1ca6f 2016-12-22 07:04:18 feststelltaste

The easy step afterward is to join the file_stats DataFrame with the commit_data.


In [10]:
commit_data = commit_data.join(file_stats)
commit_data.head()


Out[10]:
sha date author insertions deletions filename
1 ea7e08b 2016-11-29 20:42:16 feststelltaste 2.0 0.0 src/main/asciidoc/appendices/bibliography.adoc
2 ea7e08b 2016-11-29 20:42:16 feststelltaste 1.0 7.0 src/main/asciidoc/pattern-index.adoc
3 ea7e08b 2016-11-29 20:42:16 feststelltaste 12.0 1.0 src/main/asciidoc/patterns/improve/anticorrupt...
5 fa1ca6f 2016-12-22 07:04:18 feststelltaste 2.0 0.0 src/main/asciidoc/appendices/bibliography.adoc
6 fa1ca6f 2016-12-22 07:04:18 feststelltaste 2.0 2.0 src/main/asciidoc/patterns/analyze/busfactor.adoc

We're done!

Complete code block

To much code to look through? Here is everything from above in a condensed format.


In [11]:
%%time
import pandas as pd

commits = pd.read_csv(r'C:\dev\repos\aim42\git.log', sep="\u0012", header=None, names=['raw'])

commit_marker = commits[commits['raw'].str.startswith("--",na=False)]
commit_info = commit_marker['raw'].str.extract(r"^--(?P<sha>.*?)--(?P<date>.*?)--(?P<author>.*?)$", expand=True)
commit_info['date'] = pd.to_datetime(commit_info['date'])

file_stats_marker = commits[~commits.index.isin(commit_info.index)]
file_stats = file_stats_marker['raw'].str.split("\t", expand=True)
file_stats = file_stats.rename(columns={0: "insertions", 1: "deletions", 2: "filename"})
file_stats['insertions'] = pd.to_numeric(file_stats['insertions'], errors='coerce')
file_stats['deletions'] = pd.to_numeric(file_stats['deletions'], errors='coerce')

commit_data = commit_info.reindex(commits.index).fillna(method="ffill")
commit_data = commit_data[~commit_data.index.isin(commit_info.index)]
commit_data = commit_data.join(file_stats)


Wall time: 219 ms

Just some milliseconds to run through, not bad!

Summary

In this notebook, I showed you how to read some non-perfect structured data via the non-character separator trick. I also showed you how to transform the rows that contain multiple kinds of data into one nicely structured DataFrame.

Now that we have the Git repository DataFrame, we can do some nice things with it e. g. visualizing the code churn of a project, but that's a story for another notebook! But to give you a short preview:


In [12]:
%matplotlib inline
timed_commits = commit_data.set_index(pd.DatetimeIndex(commit_data['date']))[['insertions', 'deletions']].resample('1D').sum()
(timed_commits['insertions'] - timed_commits['deletions']).cumsum().fillna(method='ffill').plot()


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x27a49010978>

Stay tuned!